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Gaussian graphical models with sparsity in the inverse covariance matrix are of 
significant interest in many modern applications. For the problem of recovering 
the graphical structure, information criteria provide useful optimization objectives 
for algorithms searching through sets of graphs or for selection of tuning parame- 
ters of other methods such as the graphical lasso, which is a likelihood penaliza- 
tion technique. In this paper we establish the consistency of an extended Bayesian 
information criterion for Gaussian graphical models in a scenario where both the 
number of variables p and the sample size n grow. Compared to earlier work on 
the regression case, our treatment allows for growth in the number of non-zero pa- 
rameters in the true model, which is necessary in order to cover connected graphs. 

i i We demonstrate the performance of this criterion on simulated data when used in 

conjunction with the graphical lasso, and verify that the criterion indeed performs 

t-H better than either cross-validation or the ordinary Bayesian information criterion 

when p and the number of non-zero parameters q both scale with n. 

1 Introduction 

This paper is concerned with the problem of model selection (or structure learning) in Gaussian 
graphical modelling. A Gaussian graphical model for a random vector X = (X±, . . . , X p ) is de- 
termined by a graph G on p nodes. The model comprises all multivariate normal distributions 
• • 6 _1 ) whose inverse covariance matrix satisfies that 6^ = when {j, k} is not an edge in G. 

For background on these models, including a discussion of the conditional independence interpreta- 
tion of the graph, we refer the reader to Q). 

In many applications, in particular in the analysis of gene expression data, inference of the graph G is 
of significant interest. Information criteria provide an important tool for this problem. They provide 
the objective to be minimized in (heuristic) searches over the space of graphs and are sometimes 
used to select tuning parameters in other methods such as the graphical lasso of |2). In this work 
we study an extended Bayesian information criterion (BIC) for Gaussian graphical models. Given a 
sample of n independent and identically distributed observations, this criterion takes the form 

5JC 7 (E) = -2Z n (6(E)) + |E| logn + 4|E| 7 logp, (1) 

where E is the edge set of a candidate graph and i n (G(E)) denotes the maximized log-likelihood 
function of the associated model. (In this context an edge set comprises unordered pairs {j, k} of 
distinct elements in {1, ... ,p}.) The criterion is indexed by a parameter 7 £ [0, 1]; see the Bayesian 
interpretation of 7 given in Q. If 7 = 0, then the classical BIC of flU is recovered, which is 
well known to lead to (asymptotically) consistent model selection in the setting of fixed number of 
variables p and growing sample size n. Consistency is understood to mean selection of the smallest 
true graph whose edge set we denote Eq. Positive 7 leads to stronger penalization of large graphs 
and our main result states that the (asymptotic) consistency of an exhaustive search over a restricted 
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model space may then also hold in a scenario where p grows moderately with n (see the Main 
Theorem in Section |2j. Our numerical work demonstrates that positive values of 7 indeed lead to 
improved graph inference when p and n are of comparable size (Section^. 

The choice of the criterion in ([T} is in analogy to a similar criterion for regression models that was 
first proposed in 1 5 1 and theoretically studied in [3 6]. Our theoretical study employs ideas from 
these latter two papers as well as distribution theory available for decomposable graphical models. 
As mentioned above, we treat an exhaustive search over a restricted model space that contains all 
decomposable models given by an edge set of cardinality |E| < q. One difference to the regression 
treatment of (3][6) is that we do not fix the dimension bound q nor the dimension |E 1 of the smallest 
true model. This is necessary for connected graphs to be covered by our work. 

In practice, an exhaustive search is infeasible even for moderate values of p and q. Therefore, we 
must choose some method for preselecting a smaller set of models, each of which is then scored 
by applying the extended BIC (EBIC). Our simulations show that the combination of EBIC and 
graphical lasso gives good results well beyond the realm of the assumptions made in our theoretical 
analysis. This combination is consistent in settings where both the lasso and the exhaustive search 
are consistent but in light of the good theoretical properties of lasso procedures (see Q), studying 
this particular combination in itself would be an interesting topic for future work. 

2 Consistency of the extended BIC for Gaussian graphical models 

2.1 Notation and definitions 

In the sequel we make no distinction between the edge set E of a graph on p nodes and the asso- 
ciated Gaussian graphical model. Without loss of generality we assume a zero mean vector for all 
distributions in the model. We also refer to E as a set of entries map x p matrix, meaning the 2|E| 
entries indexed by (j, k) and (k, j) for each {j, k} E E. We use A to denote the index pairs 
for the diagonal entries of the matrix. 

Let O be a positive definite matrix supported on A U Ej. In other words, the non-zero entries 
of 80 are precisely the diagonal entries as well as the off-diagonal positions indexed by Eo; note 
that a single edge in Eo corresponds to two positions in the matrix due to symmetry. Suppose the 
random vectors Xi, . . . ,X„ are independent and distributed identically according to ^V(0, Oq 1 ). 
Let S — - XiXf be the sample covariance matrix. The Gaussian log-likelihood function 
simplifies to 

Tl 

W©) = 2 [ lo S dct ( e ) - trace(Se)] . (2) 
We introduce some further notation. First, we define the maximum variance of the individual nodes: 

j 

Next, we define 6q = min ee E |(©o)e|> the minimum signal over the edges present in the graph. 
(For edge e = {j, k}, let (0o) e = (©o)jfe = (@o)fcj-) Finally, we write A max for the maximum 
eigenvalue of 9o- Observe that the product Cmax^max is no larger than the condition number of Oo 
because 1/A min (0 o ) = A max (e o " 1 ) > cr, 2 nax . 

2.2 Main result 

Suppose that n tends to infinity with the following asymptotic assumptions on data and model: 

Eo is decomposable, with |Eo| < q, 

"max^max ^ Cj 

< p = 0(n K ), p ->• 00, (3) 
7o=7-(l-2«)>0, 
(p + 2q) \ogp x ^=sp = o(n) 

Here C, k > and 7 are fixed reals, while the integers p, q, the edge set Eo, the matrix 80, and 
thus the quantities o-„ lax , A max and 6q are implicitly allowed to vary with n. We suppress this latter 
dependence on n in the notation. The 'big oh' O(-) and the 'small oh' o(-) are the Landau symbols. 
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Main Theorem. Suppose that conditions Q hold. Let £ be the set of all decomposable models E 
with |E| < q. Then with probability tending to 1 as n —> oo, 

E = argminBKL(E). 

That is, the extended BIC with parameter 7 selects the smallest true model Eo when applied to any 
subset of £ containing Eo. 

In order to prove this theorem we use two techniques for comparing likelihoods of different mod- 
els. Firstly, in Chen and Chen's work on the GLM case 0, the Taylor approximation to the log- 
likelihood function is used and we will proceed similarly when comparing the smallest true model 
Eo to models E which do not contain Eo. The technique produces a lower bound on the decrease in 
likelihood when the true model is replaced by a false model. 

Theorem 1. Suppose that conditions ^ hold. Let £\ be the set of models E with E ^ Eo and 
|E| < q. Then with probability tending to 1 as n — > 00, 

Z„(©o) - Zn(6(E)) > 2g(Iogp)(l + 70) V E e £ X - 

Secondly, Porteous (H shows that in the case of two nested models which are both decomposable, 
the likelihood ratio (at the maximum likelihood estimates) follows a distribution that can be ex- 
pressed exactly as a log product of Beta distributions. We will use this to address the comparison 
between the model E and decomposable models E containing E and obtain an upper bound on 
the improvement in likelihood when the true model is expanded to a larger decomposable model. 

Theorem 2. Suppose that conditions Q hold. Let £0 be the set of decomposable models E with 
E D E and |E| < q. Then with probability tending to 1 as n 00, 

Z»(6(E)) - Z n (6(E )) < 2(1 + 7o )(|E| - |E |) logp VE e £ \{E }. 



Proof of the Main Theorem. With probability tending to 1 as n — > 00, both of the conclusions of 
Theorems [T and [2] hold. We will show that both conclusions holding simultaneously implies the 
desired result. 

Observe that £ C £q U £\. Choose any E e £"\{E }. If E e £q, then (by Theorem 2): 

BIC 7 (E) - BIC 7 (E ) = -2(Z n (0(E)) - /»(6(E ))) + 4(1 + 7o )(|E| - |E |) logp > 0. 

If instead E € £ 1, then (by Theorem 1, since |E | < q): 

BIC 7 (E) - BIC 7 (E ) = -2(Z n (6(E)) - Z„(6(E ))) + 4(1 + 7o )(|E| - |E |)logp > 0. 

Therefore, for any E G £\{E }, BIC 7 (E) > BIC 7 (E ), which yields the desired result. □ 

Some details on the proofs of Theorems [T] and [2] are given in Section|5] 



3 Simulations 



In this section, we demonstrate that the EBIC with positive 7 indeed leads to better model selection 
properties in practically relevant settings. We let n grow, set p oc n K for various values of k, and 
apply the EBIC with 7 e {0, 0.5, 1} similarly to the choice made in the regression context by [3 1. As 
mentioned in the introduction, we first use the graphical lasso of [2| (as implemented in the 'glasso' 
package for R) to define a small set of models to consider (details given below). From the selected 
set we choose the model with the lowest EBIC. This is repeated for 100 trials for each combination 
of values of n, p, 7 in each scaling scenario. For each case, the average positive selection rate (PSR) 
and false discovery rate (FDR) are computed. 

We recall that the graphical lasso places an penalty on the inverse covariance matrix. Given a 
penalty p > 0, we obtain the estimate 

e (9 = argmin-Z n (e) + p||e|| 1 . (4) 
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Figure 1: The chain (top) and the 'double chain' (bottom) on 6 nodes. 



(Here we may define ||0||i as the sum of absolute values of all entries, or only of off-diagonal en- 
tries; both variants are common). The t\ penalty promotes zeros in the estimated inverse covariance 
matrix 9 P ; increasing the penalty yields an increase in sparsity. The 'glasso path', that is, the set 
of models recovered over the full range of penalties p € [0, oo), gives a small set of models which, 
roughly, include the 'best' models at various levels of sparsity. We may therefore apply the EBIC to 
this manageably small set of models (without further restriction to decomposable models). Consis- 
tency results on the graphical lasso require the penalty p to satisfy bounds that involve measures of 
regularity in the unknown matrix 9o; see Minimizing the EBIC can be viewed as a data-driven 
method of tuning p, one that does not require creation of test data. 

While cross-validation does not generally have consistency properties for model selection (see [9|), 
it is nevertheless interesting to compare our method to cross-validation. For the considered simulated 
data, we start with the set of models from the 'glasso path', as before, and then perform 100-fold 
cross-validation. For each model and each choice of training set and test set, we fit the model to 
the training set and then evaluate its performance on each sample in the test set, by measuring error 
in predicting each individual node conditional on the other nodes and then taking the sum of the 
squared errors. We note that this method is computationally much more intensive than the BIC or 
EBIC, because models need to be fitted many more times. 

3.1 Design 

In our simulations, we examine the EBIC as applied to the case where the graph is a chain with node 
j being connected to nodes and to the 'double chain' , where node j is connected to nodes 

j — 2, j — l,j + 1, j + 2. Figure [T] shows examples of the two types of graphs, which have on the 
order of p and 2p edges, respectively. For both the chain and the double chain, we investigate four 
different scaling scenarios, with the exponent k selected from {0.5,0.9, 1, 1.1}. In each scenario, 
we test n = 100, 200, 400, 800, and define p oc n K with the constant of proportionality chosen such 
that p =10 when n = 100 for better comparability. 

In the case of a chain, the true inverse covariance matrix 9o is tridiagonal with all diagonal entries 
(®o)j,j set equal to 1, and the entries (Oo)i,j+i — (®o)j+i,j that are next to the main diagonal 
equal to 0.3. For the double chain, 9o has all diagonal entries equal to 1, the entries next to the main 
diagonal are (9o),-j+i = (Qo)j+i j — 0.2 and the remaining non-zero entries are (9o)j.j+2 = 
(@o)j+2,j = 0.1. In both cases, the choices result in values for 6 Q , o^ax and A max that are bounded 
uniformly in the matrix size p. 

For each data set generated from N(0, 9q 1 ), we use the 'glasso' package [2| in R to compute the 
'glasso path'. We choose 100 penalty values p which are logarithmically evenly spaced between 
Pmax (the smallest value which will result in a no-edge model) and /o max /100. At each penalty 
value p, we compute 9 P from Q and define the model E p based on this estimate's support. The R 
routine also allows us to compute the unpenalized maximum likelihood estimate 9(E p ). We may 
then readily compute the EBIC from ([TJ. There is no guarantee that this procedure will find the 
model with the lowest EBIC along the full 'glasso path', let alone among the space of all possible 
models of size < q. Nonetheless, it serves as a fast way to select a model without any manual tuning. 

3.2 Results 

Chain graph: The results for the chain graph are displayed in Figure[2] The figure shows the positive 
selection rate (PSR) and false discovery rate (FDR) in the four scaling scenarios. We observe that, 
for the larger sample sizes, the recovery of the non-zero coefficients is perfect or nearly perfect for all 
three values of 7; however, the FDR rate is noticeably better for the positive values of 7, especially 
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for higher scaling exponents k. Therefore, for moderately large n, the EBIC with 7 = 0. 5 or 7=1 
performs very well, while the ordinary BICo produces a non-trivial amount of false positives. For 
100-fold cross-validation, while the PSR is initially slightly higher, the growing FDR demonstrates 
the extreme inconsistency of this method in the given setting. 

Double chain graph: The results for the double chain graph are displayed in Figure [3] In each 
of the four scaling scenarios for this case, we see a noticeable decline in the PSR as 7 increases. 
Nonetheless, for each value of 7, the PSR increases as n and p grow. Furthermore, the FDR for the 
ordinary BICo is again noticeably higher than for the positive values of 7, and in the scaling scenar- 
ios k > 0.9, the FDR for BICo is actually increasing as n and p grow, suggesting that asymptotic 
consistency may not hold in these cases, as is supported by our theoretical results. 100-fold cross- 
validation shows significantly better PSR than the BIC and EBIC methods, but the FDR is again 
extremely high and increases quickly as the model grows, which shows the unreliability of cross- 
validation in this setting. Similarly to what Chen and Chen ||3) conclude for the regression case, 
it appears that the EBIC with parameter 7 = 0.5 performs well. Although the PSR is necessarily 
lower than with 7 = 0, the FDR is quite low and decreasing as n and p grow, as desired. 

For both types of simulations, the results demonstrate the trade-off inherent in choosing 7 in the 
finite (non-asymptotic) setting. For low values of 7, we are more likely to obtain a good (high) pos- 
itive selection rate. For higher values of 7, we are more likely to obtain a good (low) false discovery 
rate. (In the proofs given in Section [5] this corresponds to assumptions |5]l and ([6])). However, 
asymptotically, the conditions Q guarantee consistency, meaning that the trade-off becomes irrele- 
vant for large n and p. In the finite case, 7 = 0.5 seems to be a good compromise in simulations, but 
the question of determining the best value of 7 in general settings is an open question. Nonetheless, 
this method offers guaranteed asymptotic consistency for (known) values of 7 depending only on n 
and p. 



4 Discussion 



We have proposed the use of an extended Bayesian information criterion for multivariate data gener- 
ated by sparse graphical models. Our main result gives a specific scaling for the number of variables 
p, the sample size n, the bound on the number of edges q, and other technical quantities relating to 
the true model, which will ensure asymptotic consistency. Our simulation study demonstrates the 
the practical potential of the extended BIC, particularly as a way to tune the graphical lasso. The 
results show that the extended BIC with positive 7 gives strong improvement in false discovery rate 
over the classical BIC, and even more so over cross-validation, while showing comparable positive 
selection rate for the chain, where all the signals are fairly strong, and noticeably lower, but steadily 
increasing, positive selection rate for the double chain with a large number of weaker signals. 



5 Proofs 



We now sketch proofs of non-asymptotic versions of Theorems [T] and [2] which are formulated as 
Theorems [3] and [4] (Full technical details are given in the Appendix.) We also give a non-asymptotic 
formulation of the Main Theorem; see Theorem [5] In the non-asymptotic approach, we treat all 
quantities as fixed (e.g. n,p, q, etc.) and state precise assumptions on those quantities, and then give 
an explicit lower bound on the probability of the extended BIC recovering the model Eo exactly. 
We do this to give an intuition for the magnitude of the sample size n necessary for a good chance 
of exact recovery in a given setting but due to the proof techniques, the resulting implications about 
sample size are extremely conservative. 



5.1 Preliminaries 



We begin by stating two lemmas that are used in the proof of the main result, but are also more 
generally interesting as tools for precise bounds on Gaussian and chi-square distributions. First, Cai 
ifTUl Lemma 4] proves the following chi-square bound. For any n > 1, A > 0, 



P{ X l > n(l + A)} < _^ e -f(A-log(i+A)) 
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Figure 2: Simulation results when the true graph is a chain. 



We can give an analagous left-tail upper bound. The proof is similar to Cai's proof and omitted here. 
We will refer to these two bounds together as (CSB). 

Lemma 1. For any A > 0,/or n such that n > 4A~ 2 + 1, 

P{ X l < n(l - A)} < - l .^(A + iog(i-*» 
X^/n(n - 1) 

Second, we give a distributional result about the sample correlation when sampling from a bivariate 
normal distribution. 

Lemma 2. Suppose (Xi, Yi), . . . , (X n , Y n ) are independent draws from a bivariate normal distri- 
bution with zero mean, variances equal to one and covariance p. Then the following distributional 
equivalence holds, where A and B are independent x n variables: 

E(^-P) = ^(A-n)-i^(*-n). 



Proof. Let A\, Si, A2, B2, ■ ■ ■ , A n , B n be independent standard normal random variables. Define: 

* - + ^IEZb,; * = /i±4 - y/IEZlfc A = £ ^ 5 = ± Bl 

i—l i—1 

Then the variables X\, Y 1 ,X 2 ,Y 2 , . . . , X n , Y n have the desired joint distribution, and A, B are in- 
dependent Xn variables. The claim follows from writing J^. XjYi in terms of A and -B. □ 
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n=100 n=200 n=100 n=800 n=100 n=200 n=400 n=800 




n=100 n=200 n=400 n=800 n=100 n=200 n=400 n=800 



Figure 3: Simulation results when the true graph is a 'double chain'. 



5.2 Non-asymptotic versions of the theorems 

We assume the following two conditions, where eo,ei > 0, C > ffm ax A ma x, k = log„p, and 
To =7- 

{p + 2g)logp x < 1 

n 61 ~ 3200max{l +70, (1 + f ) C 2 } 

log \ogp + log(4Vl + 7o ) + 1 



2 (v l+To-l) 7Ti ^ £ o (6) 

21ogp 

Theorem 3. Suppose assumption holds. Then with probability at least 1 — ^/ 7r \ ogp P~ ei ' f or a ^ 
E 7$ E with |E| < 9, 

J»(e ) - «n(S(E)) > 2g(logp)(l + 7o ). 

Proof. We sketch a proof along the lines of the proof of Theorem 2 in ||6l, using Taylor series 
centered at the true 6o to approximate the likelihood at 9(E). The score and the negative Hessian 
of the log-likelihood function in Q are 

(1 71 fl 71 

s n (e) = — z n (e) = - (e- 1 - s) , H n (e) = -^s n (Q) = -e- 1 ® e- 1 . 

Here, the symbol Cg) denotes the Kronecker product of matrices. Note that, while we require to be 
symmetric positive definite, this is not reflected in the derivatives above. We adopt this convention 
for the notational convenience in the sequel. 
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Next, observe that 9(E) has support on A U Eo U E, and that by definition of 6$, we have the lower 
bound 1 8(E) — ©o|_f > #o m terms of the Frobenius norm. By concavity of the log-likelihood 
function, it suffices to show that the desired inequality holds for all with support on A U Eo U E 
with |0 — 0o |f = #o- By Taylor expansion, for some on the path from ©o to 0, we have: 

Z n (0) - Z„(0 O ) = vcc(0 - o ) T s„(0o) - ^vcc(0 - o ) T ff„(0)vec(0 - O ). 

Next, by (CSB) and Lemma|2j with probability at least 1 — ^ T | ogp e~ Cl logp > the following bound 
holds for all edges e in the complete graph (we omit the details): 

(s n (0 o ))|<64 ax (2 + ei)nlogp. 

Now assume that this bound holds for all edges. Fix some E as above, and fix with support on 
A U Eo U E, with |0 — 0q| = 9q. Note that the support has at most (p + 2q) entries. Therefore, 

|vec(0-0 o ) T S „(0 o )| 2 < 9 2 ( P + 2q) x 6^(2 + ei )nlogp. 

Furthermore, the eigenvalues of are bounded by A max + 9q < 2A max , and so by properties of 

Kronecker products, the minimum eigenvalue of iJ„(0) is at least ^(2A max )~ 2 . We conclude that 

/ 1 n 

ln(G) - Z„(0 O ) < V^ 2 (p + 2q) x 6< ax (2 + e 1 )nlogp--9 2 x -(2A max )- 2 . 

Combining this bound with our assumptions above, we obtain the desired result. □ 

Theorem 4. Suppose additionally that assumption Q holds ( in particular, this implies that 7 > 
1 — j^). Then with probability at least 1 — 4 ^ log - > f or a ^ decomposable models E such 
that E D E and |E| < q, 

Z„(0(E)) - Z„(0(E o )) < 2(1 + 7o )(|E| - |E |) logp. 

Proof. First, fix a single such model E, and define m = |E| - |E |. By @ [TTJ, Z«(0(E)) - 
i„(0(Eo)) is distributed as — flog (YliL 1 Bi), where Bi ~ Beta( n ~ 2 Ci , |) are independent random 
variables and the constants c\ , . . . , c m are bounded by 1 less than the maximal clique size of the 
graph given by model E, implying Cj < y/2q for each i. Also shown in [8 1 is the stochastic inequality 
— log(-Bj) < n _l._ 1 Xi- It follows that, stochastically, 

Z„(0(E)) - Z„(0(E o )) < ^ x ^-xl- 

2 n — y/2q — 1 

Finally, combining the assumptions on n,p, q and the (CSB) inequalities, we obtain: 

P{I„(9(E)) - Z„(0(E o )) > 2(1 + 7o )mlog(p)} < e -?(4(i+f )iog P ). 

4V7T lOgp 

Next, note that the number of models |E| with E D E and |E| — |E | = m is bounded by p 2m . 
Taking the union bound over all choices of m and all choices of E with that given m, we obtain that 
the desired result holds with the desired probability. □ 

We are now ready to give a non-asymptotic version of the Main Theorem. For its proof apply the 
union bound to the statements in Theorems [3] and [4] as in the asymptotic proof given in section|2] 

Theorem 5. Suppose assumptions Q and Q hold. Let £ be the set of subsets E of edges be- 
tween the p nodes, satisfying |E| < q and representing a decomposable model. Then it holds with 

probability at least 1 — . : p ° , I p~ ei that 

1 J 4 v /7rlogpl— p e ^/Trlogp-^ 

E = argminBIC 7 (E). 
That is, the extended BIC with parameter 7 selects the smallest true model. 

Finally, we note that translating the above to the asymptotic version of the result is simple. If the 
conditions Q hold, then for sufficiently large n (and thus sufficiently large p), assumptions ( 5} and 
(|6]l hold. Furthermore, although we may not have the exact equality k = \og n p, we will have 
log„ p — > k; this limit will be sufficient for the necessary inequalities to hold for sufficiently large 
n. The proofs then follow from the non-asymptotic results. 
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A Appendix 



This section gives the proof for Lemma[T] and fills in the details of the proofs of Theorems [3] and [4] 
Lemma[l^br any A e (0, 1), for any n > 4A~ 2 + 1, 



P{x 2 n < n(l - A)} < 



ri(A-log(l+A)) 



Remark 1. We note that some lower bound on n is intuitively necessary in order to be able to bound 
the 'left tail', because the mode of the x 2 distribution is at x = n — 2 (for n > 2). If A is very close 
to zero, then the 'left tail' (x 2 £ [0) n (l — ty]) actually includes the mode x = n — 2 < ri(l — A); 
therefore, we could not hope to get an exponentially small probability for being in the tail. However, 
this intuitive explanation suggests that we should have n > 0(A _1 ); perhaps the bound in this 
lemma could be tightened. 

We first prove a preliminary lemma: 

Lemma A.l. For any A > 0, for any n > 4A~ 2 + 1, 

P{xl+i <(n + 1)(1 - A)} < P{ X l < n(l - A)} . 

Proof. Let f n denote the density function for x 2 , and let /„ denote the density function for \x 2 n - 
Then, using y = x/n, we get: 

1 



fn(x) 



2™/ 2 r(n/2) 



x n/2-l e -x/2 



fn(y) 



So, 



fn+i(y) = f n (y) x 



2"/ 2 r(n/2) 
n + 1 T(n/2) (n + l^ n/2 



n/2-l -ny/2 n/2 



2 T((n + l)/2) V n 
First, note that ye~~ y is an increasing function for y < 1, and therefore 



V: 



ye- 



y e [o, l - A] 



ye 



v < (1 - X)e-^ < e- 1 (l-Y 



(Here the last inequality is from the Taylor series). Next, since logT(x) is a convex function (where 
x > 0), and since T((n + l)/2) = T((n - l)/2) x 



we see that 



r((n+l)/2) n - 1 
r(n/2) " V 2 ■ 

Finally, it is a fact that (1 + i)" < e. Putting the above bounds together, and assuming that 
y £ [0, 1 — A], we obtain 



f n +i(y) < f n (y) x 



n + 1 



71—1 



A 2 
2 



My) x 



n + 1 / A 2 



Since we require n > 4A 2 + 1, the quantity in the brackets is at most 1, and so 



Therefore, 



P 



n + 1' 



/»+i(»)</«(!/)V V G[0,l-A] 



X 2 +1 <(l-A) <P -x 2 <(1-A) 



□ 



Now we prove Lemma [T] 
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Proof. First suppose that n is even. Let /„ denote the density function of the Xn distribution. 
From Q0), if n > 2, 

P{xl < x] = 1 - 2f n (x) - P{xl-2 > = -2f n (x) + P{xl-2 < x} ■ 
Iterating this identity, we get 

P{xl<x} = P{xl<x}-2f n (x)-2f n _ 2 (x)-- 

n/2-1 

= 1-e-i -2 ^ hk+2{x) 



k=l 
n/2-1 



= l-e~5 



l-e"2 



l-e"5 



= 1-e" 



2 V 1 

^ 2 fe + 1 r(fc + i) 
V — x fc 



x fe e 2 



fe=0 



y (^/2) fc 



, fe=0 



e 2 



E 

k=n/2 



E 

fc=n/2 

(*/2) fc 
A;! 



fc=n/2 

(z/2)*' 
fc! 



(x/2) h 
k\ 



Now set x = n(l — A) for A e (0, 1). We obtain 

^ ~ (n(l-A)/2) fc 



p{xi<x] = 



E 

k=n/2 



, e -^(!V£! g (1 _ A) , 



(n/2) 



k=n/2 

»d-A) (n/2)"/ 2 (l-A)™/2 
3 2 (n/2)! ^ 



By Stirling's formula, 



and so, 



(n/2) 



n/2 



(n/2)! 



< 



P{xi < n(l - A)} < e J 



nd-x) ef (l-A)™/ 2 



/to A 



1 



\\Fwn 



,f (A+log(l-A)) 



This is clearly sufficient to prove the desired bound in the case that n is even. Next we turn to the 
odd case; let n be odd. First observe that if A > 1, the s tatem ent is trivial, while if A < 1, then 
n > 4A~ 2 + 1 > 5, therefore n — 1 is positive. By Lemma A.l and the expression above, 

P{ x l < n(l - A)} < P{xl-i < (n - 1)(1 - A)} < ' 



r i(A+log(l-A)) 



□ 



Next we turn to the theorems. Recall assumptions[5]and[6] Lemmas A. 2 and A.3 below are sufficient 
to fill in the details of Theorem|3] 
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Lemma A.2. With probability at least 1 
the complete graph: 



1 ^ R-i 



ei logp^ the following holds for all edges e in 



(« n (e ))e <6a, 4 nQ:c (2 + e>logp . 

Proof. Fix some edge e = {j, fc}. Then 

n 1 1 " 

(sn(©o))(i,fe) = 2 ( S °)i fc - 2 X ^ Xk = 2 /■>(( X i) i ( Xk ') i ~ • 

i=i 

Writer = ((Eo)^)- 1 ^., F fe = ((Eo)**) -1 ^, P = ((E ) w (E ) fcfe )- 1 (E )„ = corr(^,Y fe ). 
Then 

1 ™ 

(s™(@o))(j,fe) = -^o)jj(^o)kk ^2((Yj)i(Y k )i - p) . 

i=l 

By Lemma |2] there are some independent A, B ^ Xn sucn that 



(Sn(@o))(j\fe) = --(Eo^Eo^fc 



1+P 



(A-n) 



1-P 



(5-n) 



There are (?) < ^p 2 edges in the complete graph. Therefore, by the union bound, it will suffice to 



show that, with probability at least 1 - (Ip 2 )" 1 ^=L= e ~ £l logp 



^/Wlogp 



1 



1+P 



(A - n) 



l-p 



-1 2 



Suppose this bound does not hold. Then 

'1 + P N 



(A-n) 



> ^6(2 + 61)71 logp 



(B - n) 
1-P 



< 6(7^(2 + C]>logp • 
(B - n) > v/6(2 + ei)nlogp 



Since p € [—1,1], this implies that 

\A - n| > ^6(2 + ei)nlogp or |B - n| > ^6(2 + ei)nlogp . 
Since A = B,it will suffice to show that with probability at least 1 — p~ 2 ^=i=e _ei lo s 



\/t log p 



I A - n| < x/6(2 + ei)nlogp 



Write A = y 6(2 + Observe that, by assumption (5i, A < | and n > 3; therefore (by 

Taylor series), 



-(A-log(l + A))>- 



A 2 A 3 \ n A 2 

> - ■ y = (2 + ei)logp , and 



n — 1 , , , / n — 1 / A 2 

■ — (A + log(l-A))> — y 



" 2 ' T = ( 2 + ei ) lo s^ 



Furthermore, 



A\Ai - 1 = j6(2 + ei)logp 



n — 1 



ogp 



By (CSB) from the paper, 



P{A -n> ^6(2 + ei)nlogp} = P{A > n(l + A)} < -^=. ~ 



(A-log(l+A)) 



< 



A^A^-l) 



-(2+ei)logp < 



logp 



and also, 



P{A -n< -V6(2 + ei)nlogp} = P{A < n(l - A)} < 



< 



W*( n - 1) 
This gives the desired result. 



e -(2+ei)logp < 



logp 



-(2+ 6l ) logp 

1 

-(2+6i) logp 



pi(A+log(l-A)) 



□ 
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Lemma A.3. Recall that, in the proof of Theorem^ we showed that 

L(e) - i n (e Q ) < ^ei( P + 2q) x ^ max {2 + e^niogp - x -el x ™(2\ max )- 2 . 

Then this implies that 

«n(e)-/„(e )<-2g(logp)(l + 7o) . 
Proof. It is sufficient to show that 

I 1 71 

yj9 2 (p + 2q) x 6a^ ax (2 + e 1 )n\ogp--9 2 x — {2\ max )~ 2 < -(p + 2?)(logp)(l + 70) 
We rewrite this as 

tJax v?dl\^ ax ^ max {2 + £l ) - X -Q 2 x ^(2\ max )- 2 <-Axnx 9 2 \ m 2 ax (l + 7o ) , 
where 

^ = (p + 2g) logp x A 2 ^ 



n 



Using C > Cmax^max, it's sufficient to show that 

y/Ax n^X^ ax 6C 2 {2 + Cl ) - iflg x ^(2A max )" 2 < -A x n x ^A" 2 ax (l + 70) 

Dividing out common factors, the above is equivalent to showing that 

x 6C 2 (2 + Cl ) _ _L < - A x (1 + 7o ) . 
Id 

By assumption ([5j, we know: 

^x(l + 7o )< ' 



3200 ' 
and also, 

A x 6C 2 (2 + e 1 ) < 12 x - [ — 
K 11 ~ 3200 

Therefore, 



A x (1 + 7o ) + J A x 6C 2 (2 + e 1 ) < + J 

v w v v ; - 3200 V 3200 



12 1 

< — 



3200 16 ' 

as desired. □ 

Lemma lA~4l below is sufficient to fill in the details of Theorem!?] 

Lemma A.4. Recall that, in the proof of Theorem^ we showed that, stochastically, 

J»(0(E)) - /„(6(E )) < £ x j^-xl ■ 

2 n — y/2q — 1 

Then this implies that 

P{/„(e(E))-Z„(e(E ))>2(l+7o)mlog(p)}< — J- e -W+?)i°s*> . 

Proof First, we show that "~ v ^~ 1 > (1 + 7o)~^ ■ By assumption Q we see that: 

r*— < 4(^/1 + 70-I) . 
logp 

Now turn to assumption dsjb . We see that the right-hand side of (J5j> is < 4 ^ /1 1 +7o ■ On the left-hand 
side of (15}, by definition, A^ ax > 6* 2 . Therefore, 

(p + 2g) logp 1 
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Therefore, 



2q + 1 < P + 2g < 4(yT + ^-l) = 1 1 



and so, 

71 — v/2ff — 1 ,1 

— > (l + 7o) -5 • 

n 

Therefore, using the stochastic inequality in the statement in the lemma, 

P{/„(6(E)) - Z n (6(E )) > 2(l + 7o )mlog(p)} 

< P{x 2 m > 4(1 + 7o)m logp x U - ^ - 1 ; 



< P{Xm > 4Vl + 7omlogp} 

Now we apply the chi-square bound from ifTOl . and obtain that 

1 



P{X 2 m > 4\/rT7^ml0gp} < , . ^_ e ^f (4 v ^+^logp-l-log(4 v ^+^logp)) 

(4vl + 7o logp - 1) vvrm 
Since m > 1 and < 4(^/1 + 70 — 1), we obtain that the upper bound is at most 
1 

4y / 7T logp 
1 

4y^rlogp 



g- t (4V 1 +7o log p-l-log(4 v /l+7o log p)) 



e - t (4^1+70 log p- (log log p+log(4Vl+7o) + l)) 



1 g-f^(2 1ogp)(271T7^-(loglogp+log(47T+^) + l)/(21ogp)) 



Ay/wlogp 



By assumption ([6j, we may further bound this expression from above as 

1 c -^(21ogp)(2+e ) = 1 c -?4(l+^)logp 

4^/7r log p iy/nlogp 



□ 
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